1. Introduction - Marika

In this section, explain why you chose this topic, and the questions you are interested in studying. Include a brief description of how you found the data, and clear instructions on where the reader can find the data.

2. Team - Adam

List team members and a description of how each contributed to the project. (If you’re working alone, briefly describe the stages of the project.)

  1. Adam Coviensky -

  2. Rohan Pitre -

  3. Marika Lohmus -

3. Analysis of Data Quality - All

Zip Code, Neighborhood and Land Value Data - Adam

Sidewalk Cafe License Data - Marika

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(extracat)
require(forcats)
## Loading required package: forcats
library(ggmosaic)
## Loading required package: productplots
## 
## Attaching package: 'ggmosaic'
## The following objects are masked from 'package:productplots':
## 
##     ddecker, hspine, mosaic, prodcalc, spine, vspine
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.3.3
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
sidewalks<-read.csv("Data/Sidewalk Cafes/Sidewalk_Licenses.csv",strip.white=TRUE, na.strings=c("","NA"))
Data Quality

One of the most glaring issues in data quality is the naming of the cities in Manhattan and Queens. Capitalization differences like between “NEW YORK” and “New York” grouped cafes in the same city to be categorized differently. Similarly, cities in Queens had some abbreviation differences like between “LONG ISLAND CITY” and “LONG IS CITY” or “JACKSON HEIGHTS” and “JACKSON HTS”. I manually replaced all abbreviated or non-capitalized cities with their longer, capitalized forms.

sidewalks$CITY[sidewalks$CITY=="LONG IS CITY"]<-"LONG ISLAND CITY"
sidewalks$CITY[sidewalks$CITY=="MIDDLE VLG"]<-"MIDDLE VILLAGE"
sidewalks$CITY[sidewalks$CITY=="JACKSON HTS"]<-"JACKSON HEIGHTS"
sidewalks$CITY[sidewalks$CITY=="New York"]<-"NEW YORK"

The second data quality issue I encountered was the longitude and latitude of two restaurants. Plotting all restaurants using qmplot, I got the following result:

qmplot(LONGITUDE,LATITUDE,data=sidewalks,maptype="toner-lite",color=I("purple"))
## Using zoom = 9...
## Map from URL : http://tile.stamen.com/toner-lite/9/145/192.png
## Map from URL : http://tile.stamen.com/toner-lite/9/146/192.png
## Map from URL : http://tile.stamen.com/toner-lite/9/147/192.png
## Map from URL : http://tile.stamen.com/toner-lite/9/148/192.png
## Map from URL : http://tile.stamen.com/toner-lite/9/149/192.png
## Map from URL : http://tile.stamen.com/toner-lite/9/150/192.png
## Map from URL : http://tile.stamen.com/toner-lite/9/151/192.png
## Map from URL : http://tile.stamen.com/toner-lite/9/145/193.png
## Map from URL : http://tile.stamen.com/toner-lite/9/146/193.png
## Map from URL : http://tile.stamen.com/toner-lite/9/147/193.png
## Map from URL : http://tile.stamen.com/toner-lite/9/148/193.png
## Map from URL : http://tile.stamen.com/toner-lite/9/149/193.png
## Map from URL : http://tile.stamen.com/toner-lite/9/150/193.png
## Map from URL : http://tile.stamen.com/toner-lite/9/151/193.png
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead

Note that there is a mysterious dot all the way west of Harrisburg, PA. This should not be the case since the data should only apply to the City and boroughs of New York, so I plotted the longitude and latitude to see any outliers.

lat<-ggplot(sidewalks, aes(x=LATITUDE))+geom_histogram(binwidth=.01)+ggtitle("Latitude Histogram")
long<-ggplot(sidewalks,aes(x=LONGITUDE))+geom_histogram(binwidth=.01)+ggtitle("Longitude Histogram")
grid.arrange(lat,long,nrow=2)

You can barely see some points beow the 40.25 latitude and -77 longitude, which seem to be quite off from the rest. Next, I zoomed in on those values:

lat<-ggplot(sidewalks, aes(x=LATITUDE))+geom_histogram(binwidth=.25)+ggtitle("Latitude Histogram")+xlim(39,40.25)
long<-ggplot(sidewalks,aes(x=LONGITUDE))+geom_histogram(binwidth=.25)+ggtitle("Longitude Histogram")+xlim(-80,-77)
grid.arrange(lat,long,nrow=2)
## Warning: Removed 1589 rows containing non-finite values (stat_bin).

## Warning: Removed 1589 rows containing non-finite values (stat_bin).

There are two datapoints that have an abnormally low Latitude and Longitude as compared to the rest of the data. Taking a look at these data points, I identified them as the following businesses:

sidewalks %>% filter(LONGITUDE < -77) %>% select(BUSINESS_NAME2,LONGITUDE,LATITUDE)
##              BUSINESS_NAME2 LONGITUDE LATITUDE
## 1       MULBERRY STREET BAR -77.51958 40.11239
## 2 PRIME ONE 16 E.HARLEM NYC -77.51958 40.11239

Looking these locations up on Google, it looks like there was an error in entering their longitude and latitude. The correct values are (40.72017,-73.9968) for Mulberry Street Bar and (40.79593,-73.9357) for Prime One 16. I have corrected these values in another CSV, which I will use from here on.

sidewalks<-read.csv("Data/Sidewalk Cafes/Sidewalk_Licenses2.csv",strip.white=TRUE, na.strings=c("","NA"))
sidewalks$CITY[sidewalks$CITY=="LONG IS CITY"]<-"LONG ISLAND CITY"
sidewalks$CITY[sidewalks$CITY=="MIDDLE VLG"]<-"MIDDLE VILLAGE"
sidewalks$CITY[sidewalks$CITY=="JACKSON HTS"]<-"JACKSON HEIGHTS"
sidewalks$CITY[sidewalks$CITY=="New York"]<-"NEW YORK"

qmplot(LONGITUDE,LATITUDE,data=sidewalks,maptype="toner-lite",color=I("purple"))
## Using zoom = 12...
## Map from URL : http://tile.stamen.com/toner-lite/12/1205/1536.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1206/1536.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1207/1536.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1208/1536.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1205/1537.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1206/1537.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1207/1537.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1208/1537.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1205/1538.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1206/1538.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1207/1538.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1208/1538.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1205/1539.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1206/1539.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1207/1539.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1208/1539.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1205/1540.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1206/1540.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1207/1540.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1208/1540.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1205/1541.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1206/1541.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1207/1541.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1208/1541.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1205/1542.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1206/1542.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1207/1542.png
## Map from URL : http://tile.stamen.com/toner-lite/12/1208/1542.png
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead

Now that the offending datapoints have been fixed, the qmplot accurately zooms in on the New York area. However, since those two data points exist, it is entirely possible that there are other mis-mapped langitude and longitude points. However, without manually going through each data point, it is not easy to deduce where.

Missing Data

To further investigate data quality, I took a look at the main columns that identified a business - its license number, license status, business name (broken into two), building number, street, city, state and zip. To display this data, I used a visna plot which highlights any columns with missing values and shows the frequency of the combinations of those columns.

visna(sidewalks[,1:9])

The visna shows that only two variables have missing values - license number and business name number 2. These are expected - only licenses that have been approved would have a license number, and some businesses do not need to use a second line for their business name.

Next, I looked at additional application information which includes the sidewalk cafe type, the square footage requested, number of tables, number of chairs, Department of Health and Mental Hygiene (DOHMH) identification number, the restaurant’s longitude and latitude, the community district and city council district it belongs to, and the URL for the website of the NYC Community District.

visna(sidewalks[,10:19])

Two variables have missing values - the square footage of the sidewalk cafe and the Department of Health and Mental Hygene identification number. These are items that the restaurant would have to provide during their application process, and may not have been available at submission. These missing values should not impede with our analysis. However, if we wanted to cross-reference the health grades from the DOHMH datasets, we would have issues with the missing ID numbers.

The next set of columns to analyze is the set of application-specific fields. These include the application ID, and the sidewalk cafe type, square footage, number of tables and chairs provided by the applicant. It has the app status, the date at which the app status was reached, expiration date, the expiration date of the temporary operating order(“TOO” - if available), the application submit date, and whether the application has been received.

visna(sidewalks[,20:29])

There seem to be a few missing fields in the provided sidewalk cafe squarefootage data. However, we will not be focusing on this data point in our analysis. Expiration dates are also missing from certain licenses. Filtering out those licenses and looking at thei APP_STATUS, we can see that those licenses that are in review may not have an assigned expiration status. Indeed, this makes sense with new applications that have not been given an expiration date.

missing_expiration<-sidewalks %>% filter(is.na(EXPIRATION_DATE)) %>% select(APP_STATUS)
ggplot(missing_expiration,aes(x=APP_STATUS))+geom_bar()+ggtitle("In Review licenses have a missing expiration date")

A similar explanation arises about the applications with a missing Temporary Operating Order (“TOO”) date - only those applications that have been granted a TOO in cases where the old license has expired but the renewal is in the process of being reviewed.

The remaining colums track the status of the license approval process through various stages and will not be used in this analysis. Therefore, I will not spend time on investigating any missing data.

master_zip<-read.csv("Data/zips_master_no_missing_nbrh.csv")

Liquor License Data - Rohan

4. Executive Summary - All

Interactive Map of New York City

Provide a short nontechnical summary of the most revealing findings of your analysis with no more than 3 static graphs or one interactive graph (or link), written for a nontechnical audience. The length should be approximately 2 pages (if we were using pages…) Do not show code, and take extra care to clean up your graphs, ensuring that best practices for presentation are followed. . Note: the tips below are not intended to be a complete list of everything we’ve covered this semester on designing a successful graph. It’s meant to help you avoid some common problems. . Title, axis labels, tick mark labels, and legends should be comprehensible (easy to understand) and legible (easy to read / decipher). . Tick marks should not be labeled in scientific notation or with long strings of zeros, such as 3000000000. Instead, convert to smaller numbers and change the units: 3000000000 becomes “3” and the axis label “billions of views”. . Units should be intuitive (Extreme example: an axis labeled in month/day/year format is intuitive, one labeled in seconds since January 1, 1970 is not.) . The font size should be large enough to read clearly. The default in ggplot2 is generally too small. You can easily change it by passing the base font size to the theme, such as + theme_grey(16). (The default base font size is 11.) . The order of items on the axes and legends is logical. (Alphabetical is often not logical.) . Colors should be color vision deficiency friendly. . If a legend is taking up too much space on the right, move it to the bottom. . If categorical variable levels are long, set up the graph so the categorical variable is on the y-axis and the names are horizontal. A better option, if possible, is to shorten the names of the levels. . Not all EDA graphs lend themselves to presentation, either because the graph form is hard to understand without practice or it’s not well labeled. The labeling problem can be solved by adding text in an image editor. The downside is that it is not reproducible. If you want to go this route, Paintbrush is a free and simple bitmap image editor for the Mac: https://paintbrush.sourceforge.io/ There are many other options.

5. Main Analysis - All

Provide a detailed, well-organized description of your findings, including textual description, graphs, and code. Your focus should be on both the results and the process. Include, as reasonable and relevant, approaches that didn’t work, challenges, the data cleaning process, etc. . The guidelines for the Executive Summary above do NOT apply to exploratory data analysis. Your main concern is designing graphs that reveal patterns and trends. . As noted in Hmk #4, do not use circles, that is: bubbles, pie charts, or polar coordinates. . Use stacked bar charts sparingly. Try grouped bar charts and faceting as alternatives, and only choose stacked bar charts if they truly do a better job than the alternatives for observing patterns.

Zip Code, Neighborhood and Land Value Data - Adam

Sidewalk Cafe License Data - Marika

Any business that operates a portion of a restaurant on a public sidewalk must obtain a Sidewalk Cafe License from New Yor City. These licenses must be renewed every two years and fall into three categories: enclosed, unenclosed, or small unenclosed sidewalk cafes.

First, to help better organize the sidewalk cafe licenses by borough, I added a new column called BOROUGH that is set to MANHATTAN, BROOKLYN, BRONX, or QUEENS. I had to manually check that only the cities in Queens had been called out specifically in the CITY column, so it was easy to distinguish them from BRONX or BROOKLYN.

sidewalks <- sidewalks %>% mutate(BOROUGH = ifelse(CITY=="NEW YORK"|CITY=="New York","MANHATTAN",ifelse(CITY=="BROOKLYN","BROOKLYN",ifelse(CITY=="BRONX","BRONX","QUEENS"))))

To get a better understanding of the distribution of these licenses, I have provided a bar graph by borough.

ggplot(sidewalks, aes(x=fct_infreq(BOROUGH)))+geom_bar(aes(fill=BOROUGH))+ggtitle("Frequency of Sidewalk Cafe Licenses by Borough")+xlab("Borough")+ylab("Frequency")

Clearly Manhattan has the most license requests, followed by Brooklyn, then Queens and finally Bronx. Since at the moment we don’t have neighborhood information (everything in Manhattan is just classified as New York, Brooklyn has only Brooklyn, and Bronx has only the city of Bronx), we can only dive into the Queens data:

queens_cafes <- sidewalks %>% filter(BOROUGH=="QUEENS")
ggplot(queens_cafes, aes(x=fct_infreq(CITY)))+geom_bar(fill="purple")+ggtitle("Frequency of Sidewalk Cafe Licenses in Queens")+xlab("City / Neighborhood")+ylab("Frequency")+coord_flip()

In queens, a large percentage of license requests come from Astoria, followed by Long Island City and Forest Hills.

Next, in order to do date comparisons to ascertain which are the new applications vs. renewal applications, I had to convert certain date fields from strings (they were read in as string factors) into dates.

sidewalks$EXPIRATION_DATE<-as.Date(sidewalks$EXPIRATION_DATE, format="%m/%d/%Y")
sidewalks$APP_STATUS_DATE<-as.Date(sidewalks$APP_STATUS_DATE, format="%m/%d/%Y")
sidewalks$SUBMIT_DATE<-as.Date(sidewalks$SUBMIT_DATE, format="%m/%d/%Y")

The list of licenses includes active licenses, expired licenses, licenses for businesses that have closed (and are now inactive), licenses which are up for renewal as part of the two year process, or new requests for licenses. To better classify them, I created a new field called STATUS_CLASSIFICATION. Those licenses which are still active and not up for renewal are classified as “ACTIVE”. Those licenses that have been submitted for renewal (either because their expiration date is less than the latest application data, or that an active license is up for review) are classified as “RENEWAL”. Those licenses that are in the sheet but do not have a license number are classified as “NEW”, and the rest are marked as “OLD” to encompass inactive licenses that have not been acted upon.

sidewalks<-sidewalks %>% mutate(STATUS_CLASSIFICATION = ifelse(LIC_STATUS=="Active" & (APP_STATUS=="Application Approved" | APP_STATUS=="Application Review Completed"),"ACTIVE",ifelse(is.na(LICENSE_NBR),"NEW",ifelse((APP_STATUS_DATE>EXPIRATION_DATE | DPQA=="Issued Temp Op Letter") | (LIC_STATUS=="Active" & (APP_STATUS=="Pending Review" | APP_STATUS=="Submitted")),"RENEWAL","OLD"))))

Now that we have classified the status of the licenses, we are able to see how these classifications differ between the boroughs.

ggplot(sidewalks)+geom_mosaic(aes(x=product(STATUS_CLASSIFICATION,BOROUGH),fill=factor(STATUS_CLASSIFICATION)))+coord_flip()+labs(x="Borough",y="License Status", fill="License Designation")+ggtitle("Boroughs by License Status")

The mosaic plot shows how Bronx and Brooklyn may be getting more new license requests as a percentage of total licenses. Bronx is also getting the highest percentage of renewal requests out of its inactive and active licenses. We can also take a look at the license designations by borough:

ggplot(sidewalks)+geom_mosaic(aes(x=product(BOROUGH,STATUS_CLASSIFICATION),fill=factor(BOROUGH)))+coord_flip()+labs(x="License Status",y="Borough", fill="Borough")+ggtitle("License Status by Borough")

Looking at the data in this way, you can see how Brooklyn has the second-most new license requests, but how Manhattan still dominates in all license status categories.

We can map the data to have a better view of where the datapoints lie. To get an overall picture, I selected a map centered on Long Island City in Queens so that we can get a good view of both Brooklyn and Bronx in addition to Manhattan.

map <- get_map( location = c(-73.9485424, 40.7454513),  source = "stamen", zoom = 11, maptype="toner") 
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.745451,-73.948542&zoom=11&size=640x640&scale=2&maptype=terrain&sensor=false
## Map from URL : http://tile.stamen.com/toner/11/602/768.png
## Map from URL : http://tile.stamen.com/toner/11/603/768.png
## Map from URL : http://tile.stamen.com/toner/11/604/768.png
## Map from URL : http://tile.stamen.com/toner/11/602/769.png
## Map from URL : http://tile.stamen.com/toner/11/603/769.png
## Map from URL : http://tile.stamen.com/toner/11/604/769.png
## Map from URL : http://tile.stamen.com/toner/11/602/770.png
## Map from URL : http://tile.stamen.com/toner/11/603/770.png
## Map from URL : http://tile.stamen.com/toner/11/604/770.png
## Map from URL : http://tile.stamen.com/toner/11/602/771.png
## Map from URL : http://tile.stamen.com/toner/11/603/771.png
## Map from URL : http://tile.stamen.com/toner/11/604/771.png

Plotting each of the restaurants colored by their borough. You can see how Manhattan dominates in the number of sidewalk cafes, and how the sidewalk cafes in Brooklyn and Queens are largely concentrated in the areas closer to Manhattan.

ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.3)
## Warning: Removed 3 rows containing missing values (geom_point).

g <- ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks, geom="polygon", alpha=0.2)
g+scale_fill_gradient(low="yellow",high="red")
## Warning: Removed 3 rows containing non-finite values (stat_density2d).

ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.4)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Sidewalk Cafe Licenses by Status")
## Warning: Removed 3 rows containing missing values (geom_point).

new_active <- sidewalks %>% filter(STATUS_CLASSIFICATION=="ACTIVE" | STATUS_CLASSIFICATION=="NEW")
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Active and New Licenses in New York City")
## Warning: Removed 2 rows containing missing values (geom_point).

bronx_map <- get_map("Bronx, NY",  source = "stamen", zoom = 12, maptype="toner") 
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Bronx,+NY&zoom=12&size=640x640&scale=2&maptype=terrain&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Bronx,%20NY&sensor=false
## Map from URL : http://tile.stamen.com/toner/12/1206/1536.png
## Map from URL : http://tile.stamen.com/toner/12/1207/1536.png
## Map from URL : http://tile.stamen.com/toner/12/1208/1536.png
## Map from URL : http://tile.stamen.com/toner/12/1206/1537.png
## Map from URL : http://tile.stamen.com/toner/12/1207/1537.png
## Map from URL : http://tile.stamen.com/toner/12/1208/1537.png
## Map from URL : http://tile.stamen.com/toner/12/1206/1538.png
## Map from URL : http://tile.stamen.com/toner/12/1207/1538.png
## Map from URL : http://tile.stamen.com/toner/12/1208/1538.png
## Map from URL : http://tile.stamen.com/toner/12/1206/1539.png
## Map from URL : http://tile.stamen.com/toner/12/1207/1539.png
## Map from URL : http://tile.stamen.com/toner/12/1208/1539.png
ggmap(bronx_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Active and New Licenses in the Bronx")
## Warning: Removed 931 rows containing missing values (geom_point).

Liquor License Data - Rohan

6. Conclusion - Rohan